I’ve been watching a lot of movies lately. Last month, I was on a Grisham adaptation kick. For younger readers who may be unfamiliar with John Grisham, the author wrote several bestselling courtroom dramas in the 90s. Many of his books are set in the Deep South, and they often involve a young idealistic lawyer who bravely confronts the corrupt white male-dominated institutions of a southern city (usually Memphis). Before my run of Grisham adapations, I was watching a lot of John Hughes movies, many of which are set in the Midwest. Planes, Trains, and Automobiles, directed by Hughes, takes the audience through rural Kansas and Missouri before concluding in the suburbs of Chicago, where many Hughes movies end. Anyway, my forays into Grisham adaptations and Hughes films led me to wonder - are fewer movies set in rural and smalltown America today than in the 80s and 90s? In other words, are we in Kansas anymore?
This question led me to examine the broader question of how Hollywood film has changed over the past few decades. In this post, I look at several important trends related to this subject, including the changing relationship between genre and movie box office returns, shifts in the representation of men and women among movies’ top-billed actors, and a whole lot more. I conduct these analyses using data I collected through Wikipedia’s APIs. The data consists of 9712 movies released in the United States between 1980 and 2019. You can download it in its entirety on my Github page github.com/datadiarist/large_files/blob/master/movie_metadata_tbl.rds.
Data
One challenge with collecting movie data from the internet is the two largest sources of online movie data, Rotten Tomatoes and the Internet Movie Database, do not allow web scraping and have limited APIs. Wikipedia, on the other hand, has a comprehensive set of APIs that allows users to collect pretty much anything from the site, even content from previous versions of Wikipedia pages. The catch is that Wikipedia is a database that relies on user-generated content. One consequence of this is that the data is fairly unstandardized. For instance, movie pages provide box office information in many different formats - $100 million, 100,000,000, 100 million dollars, and so forth. I won’t go into the gory details of pulling and preprocessing this data here. It’s a lot of sprawling conditional statements and regular expression syntax. I may devote a post to the process of dealing with the many edge cases one encounters when working with Wikipedia data sometime in the future.
My sample of movies come from a group of pages that all have the headline “List of American films of [a year]”. Each of these pages has tables with movie titles and links to their pages. By drawing from these, I collected a list of names and links for 9712 movies. Next, I pulled information from the infobox of each movie page. This appears in the upper-right corner of the page. Here’s what the infobox looks like for Next, a timeless cinematic masterpiece starring Nicolas Cage as a small-time magician who can see the future, but only two minutes into the future (exactly two minutes).
For each movie, I collected the release date, box office, budget, runtime, directors, and top-billed actors from the infobox. I also gathered links to the pages of top-billed actors in each movie. Each actor page has categories, which provide information that can be used to infer actor gender and race/ethnicity. For example, take a look at the categories associated on Tory Kittles’ (Next co-star) page.
This tells us that Tory Kittles is a black male born in 1975.
Finally, I collected additional information on movies by examining main body of movie pages. Most movie pages have a “Critical Reception” section that has a movie’s Rotten Tomotoes score and the number of reviews on which this score is based. I also extracted movie genre from the introduction of each movie page. This bit of information almost always comes in the first sentence of the article, right before the first instance of the word “film” or “movie”. Finally, I used a set of rules for extracting where the film was set from the film synopsis.
Let’s have a look at the column names of the movie data.
This dataset has movie name, director and director link, genre, runtime, budget and box office information, Rotten Tomatoes review information, and release date information. After that, there is a set of columns that are nested lists containing data on top-billed actors in each movie. These nested lists contain actors’ names, links to their Wikipedia pages, race, gender, age, birthday, and more. Finally, there are several columns of movie-level actor data, including the proportion black of top-billed actors who are black and the total number of women among top-billed actors.
Let’s start with some exploratory data analysis. By sorting the data by the box office variable and taking the top ten entries, we can see the top ten highest-grossing Hollywood movies according to the data.
[1] "Avengers: Endgame" "Avatar"
[3] "Titanic" "Star Wars: The Force Awakens"
[5] "Avengers: Infinity War" "Jurassic World"
[7] "The Lion King" "The Avengers"
[9] "Furious 7" "Avengers: Age of Ultron"
Sure enough, these are the highest grossing movies of all time before adjusting for inflation. Let’s see how this list compares to an inflation-adjusted list of highest grossing films.
[1] "Titanic" "Avatar"
[3] "Avengers: Endgame" "Star Wars: The Force Awakens"
[5] "E.T. the Extra-Terrestrial" "Avengers: Infinity War"
[7] "Jurassic Park" "Jurassic World"
[9] "The Avengers" "The Empire Strikes Back"
Adjusting for inflation vaults James Cameron to the top of the list with Titanic and Avatar. We also see more of the old guard of blockbuster directors, such as Spielberg and Lucas, in this inflation-adjusted list.
Let’s try to find some weirder kinds of outliers in this data. Turning to runtime, I pull the longest and shortest movies from the data.
The Cure for Insomnia is an 87-hour long experimental film that consists of an artist reading a 4,080-page poem. It held the Guiness record for longest film before being supplanted by a non-American movie. Luxo Jr. is a 2-minute long animated film released by Pixar in 1986. It used computer-based technology that was groundbreaking at the time and was the first CGI movie to be nominated for an Oscar (it was nominated for Best Animated Short).
We can also look at which actors appear most in the data.
It turns out that Samuel L. Jackson is the hardest working actor in show business, with 76 top billings since 1980. Jackson has this distinction on lock, holding a nine-film lead on Unbreakable co-star Bruce Willis.
What other amusing outliers can we find in the data? How about worst movie of all time? I get this by filtering the data to movies that have received at least 40 Rotten Tomatoes reviews and sorting by average Rotten Tomatoes score.
These movies all received either a 0% or 1% on Rotten Tomatoes (again, based on 40+ reviews). There are some derivative horror movies (One Missed Call, Alone in the Dark) and tasteless comedies (Disaster Movie, National Lampoon’s Gold Diggers) here. We also see movies that have ended careers (Roberto Benini as Pinocchio in Pinocchio, Cubo Gooding Jr. in Daddy Day Camp). My favorite on this list is Dana Carvey’s incredibly misguided attempt to capitalize on the success of Michael Myer’s Austin Powers with The Master of Disguise.
There are many other interesting bits of information one can find in this data, and I encourage you the download the data yourself to answer some of your own questions. In the next section, I examine some broader patterns in the data.
Visualizing Trends in Hollywood Film
First, I look at how actors compare in terms of the profitability and critical success of their films. The figure below shows actors that have starred in more than 20 movies since 1980. The x-axis is the average Rotten Tomatoes score of an actor’s movies, and the y-axis is average profitability, measured as net box office returns adjusted for inflation. I’ve placed the actors in three groups. Red dots represent actors that have never been nominated for an Oscar, silver dots indicate actors that have been nominated but have never won an oscar, and gold dots represent actors that have won an oscar. The way to interpret this graph is that being farther from the axes, in the upper right part of the figure, is good, while being close to the axes, in the lower left part of the figure, is bad. You can hover your mouse over each dot to view the stats on that actor.
The figure shows a clear positive correlation between critical acclaim and box office returns. Also, the data is heteroskedastic - the spread in box office returns appears to increase as the mean Rotten Tomatoes score goes up. There’s evidence of a positive relationship between winning an Academy Award and being in positively reviewed and profitable movies. To see this clearly, click the “Nominee” label at the bottom of the figure to hide nominated actors and display only actors that have won an oscars and actors who have not been nominated. This shows pretty clearly that winning an oscar is correlated with both the critical reception and the profitability of an actor’s films.
The figure above also shows that a handful of actors have carved out a niche as “prestige” actors - while their movies may not make a lot of money, they are able to continue to get work on the critical acclaim that their movies receive. These actors can be found in the lower right-hand corner of the figure. They include Phillip Seymour Hoffman (the most critically-acclaimed actor in the sample), Frances McDormand, Edward Nortan, Denzel Washington, Jack Nicolson, Angelica Houston, and many others. These actors generally do not appear in blockbusters. The lower-left quadrant of the figure, on the other hand, has actors whose movies do not garner praise from critics or make a lot of money. Unsurprisingly, most of these actors are no longer in large-budget Hollywood films. They include Brendan Fraser, Sharon Stone, Kevin Pollack, Cuba Gooding Jr., and John Travolta.
One could infer from this figure that Alan Rickman is the greatest actor of all time. He appears at the top right of the plot. His combined Rotten Tomatoes score and mean box office returns is significantly higher than any other actor’s. His roles in Die Hard, Galaxy Quest, and the Harry Potter movies explain why he has the highest mean box office of any actor here. Shockingly, Rickman was never nominated for an Academy Award. Ironically, the Guardian gave Rickman an “honorable mention” on their list of greatest actors to never have been nominated for an oscar.
Having compared actors by critical acclaim and profitability, I now turn to the movies. The next figure shows trends in the kinds of movies that do well at the box office. Each point represents a movie, the x-axis gives the date of a movie’s release, and the y-axis indicates gross box office returns. Movies are grouped into six genres - Action, Adventure/Fantasy, Drama, Comedy, Animated, and Horror. I created these groupings using a set of rules for classifying genres. Movies with hybrid genres were categorized according to the genre that I believe should take precedence. For instance, animated comedies were placed in the Animated category. Movies that elude categorization (e.g. Titanic, which is described as an “epic romance and disaster film”) were filtered from the data in this figure. You can hover over a point to view the details for a specific movie. To filter by genre, click the genre label at the bottom of the figure.